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Abstract 

The R package spikeSlabGAM implements Bayesian variable selection, model choice, 
and regularized estimation in (geo-) additive mixed models for Gaussian, binomial, and 
Poisson responses. Its purpose is to (1) choose an appropriate subset of potential covariates 
and their interactions, (2) to determine whether linear or more flexible functional forms are 
required to model the effects of the respective covariates, and (3) to estimate their shapes. 
Selection and regularization of the model terms is based on a novel spike-and-slab-type 
prior on coefficient groups associated with parametric and semi-parametric effects. 

Keywords: MCMC, P-splines, spike-and-slab prior, normal-inverse-gamma. 



1. Introduction 

In data sets with many potential predictors, choosing an appropriate subset of covariates and 
their interactions at the same time as determining whether linear or more flexible functional 
forms are required to model the relationships between covariates and the response is a chal- 
lenging and important task. From a Bayesian perspective, it can be translated into a question 
of estimating marginal posterior probabilities of whether a variable should be in the model 
and in what form (i.e., linear or smooth; as a main effect and/or as an effect modifier). 

We introduce the R (R Development Core Team 2010) package spikeSlabGAM which imple- 
ments fully Bayesian variable selection and model choice with a spike-and-slab prior structure 
that expands the approach in Ishwaran and Rao (2005) to select or deselect single coefficients 
as well as blocks of coefficients associated with specific model terms. The spike-and-slab priors 
we use are bimodal priors for the hyper- variances of the regression coefficients which result 
in a two component mixture of a narrow spike around zero and a slab with wide support for 
the marginal prior of the coefficients themselves. The posterior mixture weights for the spike 
component for a specific coefficient or coefficient batch can be interpreted as the posterior 
probability of its exclusion from the model. 

The coefficient batches selected or deselected in this fashion can be associated with a wide 
variety of model terms such as simple linear terms, factor variables, basis expansions for the 
modeling of smooth curves or surfaces, intrinsically Gaussian Markov random fields (IGMRF), 



random effects, and all their interactions. spikeSlabGAM is able to deal with Gaussian, bi- 
nomial and Poisson responses, and can be used to fit piecewise exponential models for time- 
to-event data. For these response types, the package presented here implements regularized 
estimation, term selection, model choice, and model averaging for a similarly broad class of 
models as that available in mboost (Hothorn et al. 2010) or BayesX (Brezger et al. 2005). To 
the best of our knowledge, it is the first implementation of a Bayesian model term selection 
method that: (1) is able to fit models for non-Gaussian responses from the exponential family; 
(2) selects and estimates many types of regularized effects with a (conditionally) Gaussian 
prior such as simple covariates (both metric and categorical), penalized splines (uni- or multi- 
variate), random effects, spatial effects (kriging, IGMRF) and their interactions; (3) and can 
distinguish between smooth nonlinear and linear effects. The approach scales reasonably well 
to datasets with thousands of observations and a few hundred coefficients and is available in 
documented open source software. 

Bayesian function selection, similar to the frequentist COSSO procedure (Lin and Zhang 
2006), is usually based on decomposing the additive model in the spirit of a smoothing spline 
ANOVA (Wahba et al. 1995). Wood et al. (2002) and Yau et al. (2003) describe procedures 
for Gaussian and latent Gaussian models using a data-based prior that requires two MCMC 
runs, a pilot run to obtain a data-based prior for the slab part and a second one to estimate 
parameters and select model components. A more general approach that also allows for 
flexible modeling of the dispersion in double exponential regression models is described in 
Cottet et al. (2008), but no implementation is available. Reich et al. (2009) also use the 
smoothing spline ANOVA framework and perform variable and function selection via SSVS 
for Gaussian responses. Friihwirth-Schnatter and Wagner (2010) discuss various spike-and- 
slab prior variants for the selection of random intercepts for Gaussian and latent Gaussian 
models. 

The remainder of this paper is structured as follows: Section 2 gives some background on 
the two main ideas used in spikeSlabGAM. 2.1 introduces the necessary notation for the 
generalized additive mixed model and 2.2 fills in some details on the spike-and-slab prior. 
Section 3 relates details of the implementation: how the design matrices for the model terms 
are constructed (Section 3.1) and how the MCMC sampler works (Section 3.2). Section 4 
explains how to specify, visualize and interpret models fitted with spikeSlabGAM and contains 
an application to the Pima Indian Diabetes dataset. 
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2. Background 



2.1. Generalized additive mixed models 

The generalized additive mixed model (GAMM) is a broad model class that forms a subset 
of structured additive regression (Fahrmeir et al. 2004). In a GAMM, the distribution of the 
responses y given a set of covariates Xj {j = 1, . . . ,p) belongs to an exponential family, i.e., 

= c(y,(A)exp 

with 9,4>,b{-) and c(-) determined by the type of distribution. The conditional expected value 
of the response E{y\xi, . . . , Xp) = h{r]) is determined by the additive predictor rj and a fixed 
response function h{-). 

The additive predictor 

p 

ri = r]o + X^(3u + ^fjix) (1) 

i=i 

has three parts: a fixed and known offset rjo, a linear predictor Xufiu for model terms 
that are not under selection with coefficients associated with a very flat Gaussian prior 
(this will typically include at least a global intercept term), and the model terms fj{x) = 
{fj{xi), . . . , fj{xn))^ {j = 1, . . . ,p) that are each represented as linear combinations of dj 
basis functions Bj{-) so that 

fji^) = ^PjkBjk{x) = BjPj, with Bjkix) = (^Bjkixi), . . .,BjkixnV^ 

k=i (2) 

and (3j peNMIG(uo , w, ar, K) for j = 1, . . . ,p. 
The peNMIG prior structure is explained in detail in Section 2.2. 

Components fj{x) of the additive predictor represent a wide variety of model terms, such as 
(1) linear terms {fj{x) = f^jXj), (2) nominal or ordinal covariates {f{xji) = P^^k) iff = k, 
i.e., if entry i in xj is k), (3) smooth functions of (one or more) continuous covariates (splines, 
kriging effects, tensor product splines or varying coefficient terms, e.g.. Wood (2006)), (4) 
Markov random fields for discrete spatial covariates (e.g. Rue and Held 2005), (5) random 
effects (subject-specific intercepts or slope), and (6) interactions between the different terms 
(varying-coefficient models, effect modifiers, factor interactions). Estimates for semiparamet- 
ric model terms and random effects are regularized in order to avoid overfitting and modeled 
with appropriate shrinkage priors. These shrinkage or regularization priors are usually Gaus- 
sian or can be parameterized as scale mixtures of Gaussians (e.g. Fahrmeir et al. 2010). The 
peNMIG variable selection prior used in spikeSIabGAM can also be viewed as a scale mixture 
of Gaussians. 



hjO-b{e) \ 
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2.2. Stochastic search variable selection and spike-and-slab priors 

While analyses can benefit immensely from a flexible and versatile array of potential model 
terms, the large number of possible models in any given data situation calls for a principled 
procedure that is able to select the covariates that are relevant for the modeling effort (i.e., 
variable selection) as well as to determine the shapes of their effects (e.g., smooth vs. linear) 
and which interaction effects or effect modifiers need to be considered (i.e., model choice). 
SSVS and spike-and-slab priors are Bayesian methods for these tasks that do not rely on the 
often very difficult calculation of marginal likelihoods for large collections of complex models 
(e.g. Han and Carlin 2001). 

The basic idea of the SSVS approach (George and McCulloch 1993) is to introduce a binary 
latent variable 7^ associated with the coefficients (3j of each model term so that the contri- 
bution of a model term to the predictor is forced to be zero - or at least negligibly small - 
if 7j is in one state and left unchanged if 7^- is in the other state. The posterior distribution 
of 7j can be interpreted as marginal posterior probabilities for exclusion or inclusion of the 
respective model term. The posterior distribution of the vector 7 = (71, . . . ,7p)^ can be in- 
terpreted as posterior probabilities for the different models represented by the configurations 
of 7. Another way to express this basic idea is to assume a spike-and-slab mixture prior for 
each /3j , with one component being a narrow spike around the origin that imposes very strong 
shrinkage on the coefficients and the other component being a wide slab that imposes very 
little shrinkage on the coefficients. The posterior weights for the spike and the slab can then 
be interpreted analogously. 

The flavor of spike-and-slab prior used in spikeSlabGAM is a further development based on 
Ishwaran and Rao (2005): The basic prior structure, which we call a Normal - mixture of 
inverse Gammas (NMIG) prior, uses a bimodal prior on the variance of the coefficients 
that results in a spike-and-slab type prior on the coefficients themselves. For a scalar /3, the 
prior structure is given by: 

/3|7,r2 P;i°'"iV(0,t;2) with v^=t^^, 
-i\w ""^ w/i(7) + (1 - u')/„o(7), 

2 prior 1, \ 

prior / , s 

and tt; ~ aeia.{ayj^bw). 

Ix{y) denotes a function that is 1 in x and everywhere else and vq is some small positive 
constant, so that the indicator 7 is 1 with probability w and close to zero with probability 
1 — w. This means that the effective prior variance is very small if 7 = — this is the 
spike part of the prior. The variance is sampled from an informative Inverse Gamma (r~^) 
prior with density p{x\a, b) = j^^;^"""'^^ exp (— |) . 

This prior hierarchy has some advantages for selection of model terms for non-Gaussian data 
since the selection (i.e., the sampling of indicator variables 7) occurs on the level of the 
coefficient variance. This means that the likelihood itself is not in the Markov blanket of 7 and 
consequently does not occur in the full conditional densities (FCD) for the indicator variables, 
so that the FCD for 7 is available in closed form regardless of the likelihood. However, since 
only the regression coefficients and not the data itself occur in the Markov blanket of 7, 
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inclusion or exclusion of model terms is based on the magnitude of the coefficients /3 and not 
on the magnitude of the effect BP itself. This means that design matrices have to be scaled 
similarly across all model terms for the magnitude of the coefficients to be a good proxy for 
the importance of the associated effect. In spikeSlabGAM, each term's design matrix is scaled 
to have a Frobenius norm of 0.5 to achieve this. 



A parameter- expanded NMIG prior 

While the conventional NMIG prior (3) works well for the selection of single coefficients, it is 
unsuited for the simultaneous selection or deselection of coefficient vectors, such as coefficients 
associated with spline basis functions or with the levels of a random intercept. In a nutshell, 
the problem is that a small variance for a batch of coefficients implies small coefficient values 
and small coefficient values in turn imply a small variance so that blockwise MCMC samplers 
are unlikely to exit a basin of attraction around the origin. Gelman et al. (2008) analyze this 
issue in the context of hierarchical models, where it is framed as a problematically strong 
dependence between a block of coefficients and their associated hypervariance. A bimodal 
prior for the variance, such as the NMIG prior, obviously exacerbates these difficulties as the 
chain has to be able to switch between the different components of the mixture prior. The 
problem is much less acute for coefficient batches with only a single or few entries since a small 
batch contributes much less information to the full conditional of its variance parameter. The 
sampler is then better able to switch between the less clearly separated basins of attraction 
around the two modes corresponding to the spike and the slab (Scheipl 2010, Section 3.2). 
In our context, "switching modes" means that entries in 7 change their state from 1 to vq or 
vice versa. The practical importance for our aim is clear: Without fast and reliable mixing 
of 7 for coefficient batches with more than a few entries, the posterior distribution cannot be 
used to define marginal probabilities of models or term inclusion. In previous approaches, this 
problem has been circumvented by either relying on very low dimensional bases with only a 
handful of basis functions (Reich et al. 2009; Cottet et al. 2008) or by sampling the indicators 
from a partial conditional density, with coefficients and their hypervariances integrated out 
(Yau et al. 2003). 

A promising strategy to reduce the dependence between coefficient batches and their variance 
parameter that neither limits the dimension of the base nor relies on repeated integration 
of multivariate functions is the introduction of working parameters that are only partially 
identifiable along the lines of parameter expansion or marginal augmentation (Meng and 
van Dyk 1997; Gelman et al. 2008). The central idea implemented in spikeSlabGAM is a 
multiplicative parameter expansion that improves the shrinkage properties of the resulting 
marginal prior compared to NMIG (Scheipl 2011, Section 4.1.1) and enables simultaneous 
selection or deselection of large coefficient batches. 

Figure 1 shows the peNMIG prior hierarchy for a model with p model terms: We set (3j = ctj^j 
with mutually independent aj and for a coefficient batch f3j with length dj and use a scalar 

parameter aj ^^"^ NMIG(fo, w, Or, br), where NMIG denotes the prior hierarchy given in (3). 

Entries of the vector are i. i. d. N{mjk, 1) {k = 1, . . . ,dj; j = 1, . . . ,p) with prior 

means mjk either 1 or -1 with equal probability. 
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peNMIG: Normal mixture of inverse Gammas with parameter expansion 



1 




Figure 1: Directed acyclic graph of the peNMIG prior structure. Elhpses are stochastic 
nodes, rectangles are deterministic nodes. Single arrows are stochastic edges, double arrows 
are deterministic edges. 



We write 

f3j ~ peNMIG (?;o, w,ar,br) (4) 

as shorthand for this parameter expanded NMIG prior. The effective dimension of each 
coefficient batch associated with a specific 7j and r? is then just one, since the Markov blankets 
of both 7j and Tj now only contain the scalar parameter aj instead of the vector Pj . This solves 
the mixing problems for 7 described above. The long vector ^ = {^J , . . . , ^J)""" is decomposed 
into subvectors associated with the different coefficient batches and their respective entries 
Oj {j = l,...,p) in a. The parameter w is a global parameter that influences all model 
terms, it can be interpreted as the prior probability of a term being included in the model. The 
parameter aj parameterizes the "importance" of the j-th model term, while "distributes" aj 
across the entries in the coefficient batch Pj. Setting the conditional expectation E{^jk\mjk) = 
±1 shrinks \^jk\ towards 1, the multiplicative identity, so that the interpretation of aj as the 
"importance" of the j-th coefficient batch can be maintained. 

The marginal peNMIG prior, i.e., the prior for /3 integrated over the intermediate quantities a, 
^, T^, 7 and w, combines an infinite spike at zero with heavy tails. This desirable combination 
is similar to the properties of other recently proposed shrinkage priors such as the horseshoe 
prior (Carvalho et al. 2010) and the normal-Jeffreys prior (Bae and Mallick 2004) for which 
both robustness for large values of /? and very efficient estimation of sparse coefficient vectors 
have been shown (Poison and Scott 2010). The shape of the marginal peNMIG prior is fairly 
close to the original spike-and-slab prior suggested by Mitchell and Beauchamp (1988), which 
used a mixture of a point mass in zero and a uniform distribution on a finite interval, but 
it has the benefit of (partially) conjugate and proper priors. A detailed derivation of the 
properties of the peNMIG prior and an investigation of its sensitivity to hyperparameter 
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settings is in Scheipl (2011), along with performance comparisons against mboost and other 
approaches with regard to term selection, sparsity recovery, and estimation error for Gaussian, 
binomial and Poisson responses on real and simulated data sets. The default settings for the 
hyperparameters, validated through many simulations and data examples are ar = 5,br = 
25, vo = 2.5 • 10"^. By default, we use a uniform prior on w, i.e., Uw = bw = 1, and a very flat 
r^^(10~'^, 10~^) prior for the error variance in Gaussian models. 

3. Implementation 



3.1. Setting up the design 

All of the terms implemented in spikeSlabGAM have the following structure: First, their 
contribution to the predictor rj is represented as a linear combination of basis functions, i.e., 
the term associated with covariate x is represented as f{x) = Ylk=i ^kBk{x) = B5, where 6 
is a vector of coefficients associated with the basis functions -Bfc(-) {k = 1, . . . , K) evaluated 

in X. Second, 5 has a (conditionally) multivariate Gaussian prior, i.e., 5\v'^ ^^'^ N{Q,v'^P~), 
with a fixed scaled precision matrix P that is often positive semi-definite. Table 1 gives an 
overview of the model terms available in spikeSlabGAM and how they fit into this framework. 

Formula (2) glosses over the fact that every coefficient batch associated with a specific term 
will have some kind of prior dependency structure determined by P. Moreover, if P is only 
positive semi-definite, the prior is partially improper. For example, the precision matrix 
for a B-spline with second order difference penalty implies an improper fiat prior on the 
linear and constant components of the estimated function (Lang and Brezger 2004). The 
precision matrix for an IGMRF of first order puts an improper fiat prior on the mean level 
of the IGMRF (Rue and Held 2005, ch. 3). These partially improper priors for splines and 
IGMRFs are problematic for spikeSlabGAM's purpose for two reasons: In the first place, 
if e.g., coefficient vectors that parameterize linear functions are in the nullspace of the prior 
precision matrix, the linear component of the function is estimated entirely unpenalized. This 
means that it is unaffected by the variable selection property of the peNMIG prior and thus 
always remains included in the model, but we need to be able to not only remove the entire 
effect of a covariate (i.e., both its penalized and unpenalized parts) from the model, but also 
be able to select or deselect its penalized and unpenalized parts separately. The second issue 
is that, since the nullspaces of these precision matrices usually also contain coefficient vectors 
that parameterize constant effects, terms in multivariate models are not identifiable, since 
adding a constant to one term and subtracting it from another does not affect the posterior. 

Two strategies to resolve these issues are implemented in spikeSlabGAM. Both involve two 
steps: (1) Splitting terms with partially improper priors into two parts - one associated with 
the improper/unpenalized part of the prior and one associated with the proper/penalized 
part of the prior; and (2) absorbing the fixed prior correlation structure of the coefficients 
implied by P into a transformed design matrix B associated with then a priori independent 
coefficients (3 for the penalized part. Constant functions contained in the unpenalized part 
of a term are subsumed into a global intercept. This removes the identifiability issue. The 
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R-syntax 


Description 


B 


P 


lin(x, 


degree) 


linear/polynomial trend: 
basis functions are or- 
tliogonal polynomials of 
degree 1 to degree eval- 
uated in x; defaults to 
degree= 1 


poly(x, degree) 


identity matrix 


X u V-X./ 




factor: defaults to sum- 
to-zero contrasts 


depends on contrasts 


identity matrix 




rnd (x , 




random intercept: de- 
faults to i. i. d.; i.e., cor- 
relation C = / 


indicator variables for 
each level of x 




sm(x) 




univariate penalized 
spline: defaults to cubic 
B-splines with 2'^'^ order 
difference penalty 


B-spline basis functions 


A'^ ' A'' with A'' the d*'' 
diff. operator matrix 


srf (xy) 




penalized surface estima- 
tion on 2-D coordinates 
xy: defaults to tensor 
product cubic B-spline 
with first order difference 
penalties 


(radial) basis functions 
(thin plate / tensor prod- 
uct B-spline) 


depends on basis function 


mrf (x, 


N) 


first order intrinsic 


indicator variables for re- 


precision matrix of MRF 






Gauss-Markov random 


gions in X 


defined by (weighted) ad- 






field: factor x defines the 




jacency matrix N 






grouping of observations, 










N defines the neighbor- 










hood structure of the 










levels in x 







Table 1: Term types in spikeSlabGAM. The semiparametric terms (sm(), srf (), mrf ()) 
only parameterize the proper part of their respective regularization priors (see Section 3.1). 
Unpenalized terms not associated with a peNMIG prior (i.e., the columns in in (1)) are 
specified with term type u() . 
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remainder of the unpenaHzed component enters the model in a separate term, e.g., P-splines 
(term type smO, see Table 1) leave polynomial functions of a certain order unpenalized and 
these enter the model in a separate lin()-term. 



Orthogonal decomposition The first strategy, used by default, employs a reduced rank 
approximation of the implied covariance of f{x) to construct B, similar to the approaches 
used in Reich et al. (2009) and Cottet et al. (2008): 

bmce f{x) = B6 N{0,v^BP-B^), we can use the spectral decomposition BP B = 
UDU^ with orthonormal U and diagonal D with entries > to find an orthogonal basis 
representation for COV (^f{x)^. For B with d columns and full column rank and P with 

rank d — np, where np is the dimension of the nullspace of P, all eigenvalues of COV ^/(a;)^ 

except the first d — np are zero. Now write COV ^/(a;)^ = \U+Uq] ['^^ q] t^+^o]^, where 
Uj^ is a matrix of eigenvectors associated with the positive eigenvalues in , and Uq are the 

eigenvectors associated with the zero eigenvalues. With B = Uj^d\I_'^ and /3 "^^^^ A^(0,u^/), 
j{x) = B(5 has a proper Gaussian distribution that is proportional to that of the partially 
improper prior of f{x) (Rue and Held 2005, eq. (3.16)) and parameterizes only the penal- 
ized/proper part of f{x), while the unpenalized part of the function is represented by Uq. 

In practice, it is unnecessary and impractically slow to compute all n eigenvectors and values 
for a full spectral decomposition UDU . Only the first d — np are needed for B, and of 
those the first few typically represent most of the variability in f{x). spikeSlabGAM makes 
use of a fast truncated bidiagonalization algorithm (Baglama and Reichel 2006) implemented 
in irlba (Lewis 2009) to compute only the largest d — np eigenvalues of COV ^/(cc)^ and their 
associated eigenvectors. Only the first d eigenvectors and -values whose sum represents at 
least .995 of the sum of all eigenvalues are used to construct the reduced rank orthogonal 
basis B with d columns, e.g., for a cubic P-spline with second order difference penalty and 
20 basis functions (i.e., d = 20 columns in B and np = 2), B will typically have only 8 to 12 
columns. 



"Mixed model" decomposition The second strategy reparameterizes via a decomposition 
of the coefficient vector d into an unpenalized part and a penalized part: d = X^fBu + Xp(3, 
where is a basis of the np-dimensional nullspace of P and Xp is a basis of its complement. 
spikeSlabGAM uses a spectral decomposition of P with P = [A+Aq] [^^j' g] [A+Aq]"'", where 
A_(. is the matrix of eigenvectors associated with the positive eigenvalues in F^-, and Aq 
are the eigenvectors associated with the zero eigenvalues. This decomposition yields Xu = 
Aq and Xp = L{L~^ L)~^ with L = A+r^^. The model term can then be expressed as 
BS = B(X^P^ + Xp(3) = Bu(3u + Bp with B u as the design matrix associated with the 
unpenalized part and B as the design matrix associated with the penalized part of the term. 
The prior for the coefficients associated with the penalized part after reparameterization is 
then P ~ N{0,v'^I), while /3„ has a flat prior (c.f. Kneib 2006, ch. 5.1). 
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Interactions Design matrices for interaction effects are constructed from tensor products 
(i.e., column- wise Kronecker products) of the bases for the respective main effect terms. For 
example, the complete interaction between two numeric covariates xi and X2 with smooth 
effects modeled as P-spIines with second order difference penalty consists of the interactions 
of their unpenalized parts (i.e., linear xi-linear X2), two varying-coefficient terms (i.e., smooth 
xix linear X2, linear xix smooth X2) and a 2-D nonlinear effect (i.e., smooth xix smooth 
X2). By default, spikeSlabGAM uses a reduced rank representation of these tensor product 
bases derived from their partial singular value decomposition as described above for the 
'"orthogonal" decomposition. 



"Centering" the effects By default, spikeSlabGAM makes the estimated effects of all 
terms orthogonal to the nullspace of their associated penalty and, for interaction terms, 
against the corresponding main effects as in Yau et al. (2003). Every B is transformed 
via B B {I - Z{Z^ Z)-^Z^). For simple terms (i.e., fct(), lin(), rndO), Z = 1 and 
the projection above simply enforces a sum-to-zero constraint on the estimated effect. For 
semi-parametric terms, Z is a basis of the nullspace of the implied prior on the effect. For 
interactions between d main effects, Z = [1 Bi B2 ■ ■ ■ B^], where Bi, . . . , B^ are the design 
matrices of the involved main effects. This centering improves separability between main 
effects and their interactions by removing any overlap of their respective column spaces. All 
uncertainty about the mean response level is shifted into the global intercept. The projection 
uses the QR decomposition of Z for speed and stability. 



3.2. Markov chain Monte Carlo implementation 

spikeSlabGAM uses the blockwise Gibbs sampler summarized in Algorithm 1 for MCMC 
inference. The sampler cyclically updates the nodes in Figure 1. The FCD for a is based on 
the "collapsed" design matrix Xa = X blockdiag(^i, . . . ,^p), while ^ is sampled based on a 
"rescaled" design matrix = X blockdiag(l(ii, . . . , ldp)ct, where is a d x 1 vector of ones 
and X = [Xu Bi . . . Bp] is the concatenation of the designs for the different model terms (see 
(1)). The full conditionals for a and ^ for Gaussian responses are given by 

Q;| • ~ N{fj,a, ^a) with 

5]„ = (\xJXa + diag (-/t'^) M , fij = ^E^Xjy, and 
$\ - ~ Af(/i5,S;g) with 

= [l^I^i + ^) ; /^.- = [l^Jy + ^) ■ 

For non-Gaussian responses, we use penalized iteratively re- weighted least squares (P-IWLS) 
proposals (Lang and Brezger 2004) in a Metropolis-Hastings step to sample a and ^, i.e., 
Gaussian proposals are drawn from the quadratic Taylor approximation of the logarithm of 
the intractable FCD. Because of the prohibitive computational cost for large q and p (and low 
acceptance rates for non-Gaussian response for high-dimensional IWLS proposals), neither a 
nor ^ are updated all at once. Rather, both a and ^ are split into ba (b^) update blocks that 
are updated sequentially conditional on the states of all other parameters. 
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Algorithm 1 MCMC sampler for peNMIG 



Initialize (via IWLS for non- Gaussian response) 

Compute a(0),|(0),xi°^ 

for iterations t = 1 , . . . , T do 

for blocks b = 1, . . . ,ba do 

update q[*^ from its FCD (Gaussian case, see (5))/ via P-IWLS 

set ) = X blockdiag(ld, , . . . , ld,W^ 

update mf \ ...,mq^ from their FCD: P{rnp = 1|-) 

for blocks h = 1, . . . , 6g do 

update from its FCD (Gaussian case, see (5))/ via P-IWLS 
for model terms j = 1, . . . ,p do 
rescale ^j*^ and 



l+exp(-2^p)) 



set X^a^ = X blockdiag(^f \ • • • , 4*^) 

/ 2{t) 

update ri2W,...,rp2W from their FCD: rf*^|- ~ T'^ + 1/2, br + ^ 



{*) 



update 7iW,...,7p« from their FCD: ^^^^^ = ^o^'exp (^^) 

update from its FCD: ~ Beta (a^ + /i(7f ), + ^0(7^! 

if y is Gaussian then 

update from its FCD: (^(*)|- ~ F"! (^a^ + n/2, 6^ + Si^^i_:!£i)!^ 



By default, starting values (3^^^ are drawn randomly in three steps: First, 5 Fisher scoring 
steps with fixed, large hypervariances are performed to reach a viable region of the parameter 
space. Second, for each chain run in parallel, Gaussian noise is added to this preliminary /3^'^\ 
and third its constituting p subvectors are scaled with variance parameters 'JjtJ {j = 1, . . . ,p) 
drawn from their priors, this means that, for each of the parallel chains, some of the p model 
terms are set close to zero initially, and the remainder is in the vicinity of their respective ridge- 
penalized MLEs. Starting values for a^^'^ and ^^^^ are then computed via a^^^ = dj^ Ylt^ If^ji'^l 
and = /a"^\ Section 3.2 in Scheipl (2011) contains more details on the sampler. 



4. Using spikeSlabGAM 



4.1. Model specification and post-processing 

spikeSlabGAM uses the standard R formula syntax to specify models, with a slight twist: 
Every term in the model has to belong to one of the term types given in Table 1. If a model 
formula contains "raw" terms not wrapped in one of these term type functions, the package will 
try to guess appropriate term types: For example, the formula y ~ x + f with a numeric 
X and a factor f is expanded into y ~ lin(x) + sm(x) + fct(f) since the default is to 
model any numeric covariate as a smooth effect with a lin()-term parameterizing functions 
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from the nullspace of its penalty and an sm()-term parameterizing the penahzed part. The 
model formula defines the candidate set of model terms that comprise the model of maximal 
complexity under consideration. As of now, indicators 7 are sampled without hierarchical 
constraints, i.e., an interaction effect can be included in the model even if the associated main 
effects or lower order interactions are not. 

We generate some artificial data for a didactic example. We draw n = 200 observations from 
the following data generating process: 

• covariates sml , sm2, noise2, noiseS are ''~ ' [/[0, 1], 

• covariates f , noise4 are factors with 3 and 4 levels, 

• covariates linl , lin2, lin3 are ' A^(0, 1), 

• covariate noisel is collinear with sml: noise 1 = sml + e; a ~ N{0, 1), 

• r] = /(sml) + /(sm2, f ) + 0.1 • linl + 0.2 • lin2 + 0.3 • lin3 (see Figures 4 and 5 for the 

shapes of the nonlinear effects /(sml) and /(sm2,f)), 

• the response vector y = ri+ ^'^^^ e is generated under signal-to- noise ratio snr = 3 with 
i.i.d. ts-distributed errors ei {i = 1, . . . ,n). 

R> set.seed(1312424) 

R> n <- 200 

R> snr <- 3 

R> sml <- runif(n) 

R> fsml <- dbeta(sml, 7, 3)/2 

R> sm2 <- runifCn, 0, 1) 

R> f <- gl(3, n/3) 

R> ff <- as .numeric (f) /2 

R> fsm2f <- ff + ff * sm2 + ((f == 1) * -dheta(sm2, 6, 4) + 
+ (f == 2) * dheta(sm2, 6, 9) + (f == 3) * dbeta(sm2, 

+ 9, 6))/2 

R> lin <- matrix (rnorm(n * 3) , n, 3) 

R> colnames(lin) <- pasteC'lin" , 1:3, sep = "") 

R> noisel <- sml + rnorm(n) 

R> noise2 <- runif(n) 

R> noise3 <- runif(n) 

R> noise4 <- sample (gl (4, n/4)) 

R> eta <- fsml + fsm2f + lin X*% c(0.1, 0.2, 0.3) 

R> y <- eta + sd (eta) /snr * rt(n, df = 5) 

R> d <- data. frame (y , sml, sm2, f, lin, noisel, noise2, 

+ noise3, noise4) 

We fit an additive model with all covariates as main effects and first-order interactions between 
the first 4 as potential model terms: 
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R> fl <- y ~ (sml + sm2 + f + linl)~2 + lin2 + lin3 + noisel + 
+ noise2 + noise3 + noise4 

The function spikeSlabGAM sets up the design matrices, calls the sampler and returns the 
results: 

R> m <- spikeSlabGAM (formula = fl, data = d) 

The following output shows the first part of the summary of the fitted model. Note that the 
numeric covariates have been split into linO- and sm() -terms and that the factors have been 
correctly identified as f ct()-terms. The joint effect of the two numerical covariates sml and 
sm2 has been decomposed into 8 components: the 4 marginal linear and smooth terms, their 
linear-linear interaction, two "varying coefficient" terms (i.e., linear-smooth interactions) and a 
smooth interaction surface. This decomposition can be helpful in constructing parsimonious 
models. If a decomposition into marginal and joint effects is irrelevant or inappropriate, 
bivariate smooth terms can alternatively be specified with a srf ()-term. Mean posterior 
deviance is ^ — 2Z(y|r7(*\ (Z)*-*^), the average of twice the negative log-likelihood of the 
observations over the saved MCMC iterations, the null deviance is twice the negative log- 
likelihood of an intercept model without covariates. 

R> summary (m) 

Spike-and-Slab STAR for Gaussian data 
Model : 

y ~ ((liii(sml) + sm(sml)) + (lin(sm2) + sm(sm2)) + fct(f) + (lin(linl) + 
sm(linl)))"2 + (liii(lin2) + sm(lin2)) + (linClinS) + sm(lin3)) + 
(lin (noisel) + sm(noisel) ) + (lin(noise2) + sm(noise2) ) + 
(lin(noise3) + sm(noise3)) + fct(noise4) - lin(sml) : sm(sml) - 
lin(sin2) : siii(sm2) - lin(linl) : siii(linl) 

200 observations; 257 coefficients in 37 model terms. 

Prior: 

a[tau] b[tau] v[0] a[w] b [w] a[sigma~2] 

5.0e+00 2.5e+01 2.5e-04 l.Oe+00 l.Oe+00 l.Oe-04 
b [sigma~2] 
1 . Oe-04 

MCMC: 

Saved 1500 samples from 3 chain(s), each ran 2500 iterations after a 
burn-in of 100 ; Thinning: 5 

Null deviance: 704 
Mean posterior devieince: 284 
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Marginal posterior inclusion probabilities and term importcince : 
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*:P(gamma=l)>.25 **:P(gamma=l)>.5 *** :P(gamma=l) > . 9 

In most applications, the primary focus will be on the marginal posterior inclusion probabil- 
ities P (gamma = 1), given along with a measure of term importance pi and the size of the 
associated coefficient batch dim. pi is defined as tTj = ffj fj^i / ff^iifj^i, where fjj is the pos- 
terior expectation of the linear predictor associated with the j*'* term, and f\^i is the linear 
predictor minus the intercept. Since Yl^'^j — 1' the pi values provide a rough percentage 
decomposition of the sum of squares of the (non-constant) linear predictor (Gu 1992). Note 
that they can assume negative values as well for terms whose contributions to the linear pre- 
dictor fjj are negatively correlated with the remainder of the (non-constant) linear predictor 
fj^i — fjj. The summary shows that almost all true effects have a high posterior inclusion prob- 
ability (i.e., lin() for lin2, lin3; lin(),sm() for sml, sm2; fct(f); and the interaction 
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Posterior model probabilities (inclusion threshold = 0.5 ): 
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Figure 2: Excerpt of the second part of the output returned by summary . spikeSlabGAM, 
which tabulates the configurations of P(7j = 1) > .5 with highest posterior probabihty. In 
the example, the posterior is very concentrated in the true model without linl, which has a 
posterior probability of 0.31. The correct model that additionally includes linl (column 6) 
has a posterior probability of 0.02. 
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terms between sni2 and f ). All the terms associated with noise variables and the superfluous 
smooth terms for linl , lin2, lin3 as well as the superfluous interaction terms have a very 
low posterior inclusion probability. The small linear influence of linl has not been recovered. 

Figure 2 shows an excerpt from the second part of the summary output, which summarizes the 
posterior of the vector of inclusion indicators 7. The table shows the different configurations 
of -P(7j = 1) > .5, j = 1, . . . ,p sorted by relative frequency, i.e., the models visited by the 
sampler sorted by decreasing posterior support. For this simulated data, the posterior is 
concentrated strongly on the (almost) true model missing the small linear effect of linl. 

4.2. Visualization 

spikeSlabGAM offers automated visualizations for model terms and their interactions, imple- 
mented with ggplot2 (Wickham 2009). By default, the posterior mean of the linear predictor 
associated with each covariate (or combination of covariates if the model contains interac- 
tions) along with (pointwise) 80% credible intervals is shown. Figure 3 shows the estimated 
effects for ml. 

Plots for specific terms can be requested with the label argument, Figures 4 and 5 show 
code snippets and their output for /(sml) and /(siii2,f). The fits are quite close to the 
truth despite the heavy-tailed errors and the many noise terms included in the model. Full 
disclosure: The code used to render Figures 4 and 5 is a little more intricate than the code 
snippets we show, but the additional code only affects details (font and margin sizes and the 
arrangement of the panels) . 

4.3. Assessing convergence 

spikeSlabGAM uses the convergence diagnostics implemented in R2WinBUGS (Sturtz et al. 
2005). The function ssGAM2Bugs() converts the posterior samples for a spikeSlabGAM-object 
into a bugs-object, for which graphical and numerical convergence diagnostics are available via 
plot and print. Note that not all cases of non- convergence should be considered problematic, 
e.g., if one of the chains samples from a different part of the model space than the others, but 
has converged on that part of the parameter space. 

4.4. Pima indian diabetes 

We use the time-honored Pima Indian Diabetes dataset as an example for real non-gaussian 
data: This dataset from the UCI repository (Newman et al. 1998) is provided in package 
mlbench (Leisch and Dimitriadou 2010) as PimaIndiansDiabetes2. We remove two columns 
with a large number of missing values and use the complete measurements of the remaining 
7 covariates and the response (diabetes Yes/No) for 524 women to estimate the model. We 
set aside 200 observations as a test set: 

fi> data("PimaIndiansDiahetes2" , package = "mlhench") 

R> pimaDiab <- na.omit(PimaIndiansDiabetes2[, -c(4, 5)]) 
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R> plot(m) 
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Figure 3: Posterior means and pointwise 80% credible intervals for ml. Interaction surfaces 
of two numerical covariates are displayed as color coded contour plots, with regions in which 
the credible interval does not overlap zero marked in blue (?? < 0) or red (r/ > 0). Each panel 
contains a marginal rug plot that shows where the observations are located. Note that the 
default behavior of plot . spikeSlabGAM is to cumulate all terms associated with a covariate 
or covariate combination. In this example, the joint effects of the first 4 covariates sml , sm2 , 
f and linl and the sums of the lin- and sm-terms associated with lin2, lin3, noisel, 
noise2 and noiseS are displayed. All effects of the noise variables are ~ 0, note the different 
scales on the vertical axes. Vertical axes can be forced to the same range by setting option 
c ommonEt aS c al e . 



17 



R> plot(m, labels = c("lin(sml) " , "sm(sml) ") , cumulative = FALSE) 
R> trueFsml <- data, frame (truth = fsml - meaii(fsml) , sml = sml) 
R> plot(m, labels = "sm(sml) " , ggElems = list(geom_line(aes(x = sml, 
+ y = truth), data = trueFsml, linetype = 2))) 




Figure 4: Posterior means and pointwise 80% credible intervals for /(sml) in ml. Left and 
middle panel show the separate linO- and sm()-terms returned by the first call to plot, 
right panel shows their sum. True shape of /(sml) added as a dashed line with the ggElems 
option of plot . spikeSlabGAM. 



R> trueFsm2f <- data. frame (truth = fsm2f - mean(fsm2f) , 
+ sm2 = sm2, f = f) 

R> plot(m, labels = "sm(sm2) :fct(f) " , ggElems = list(geom_line(aes(x = sm2, 
+ y = truth, colour = f), data = trueFsm2f, linetype = 2))) 
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Figure 5: Posterior means and pointwise 80% credible intervals for /(sm2,f) in ml. True 
shape of /(sm2|f) added as dashed line for each level of f . 
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fl> pimaDiah <- within (pimaDiah, { 

+ diabetes <- 1 * (diabetes == "pos") 

+ }) 

R> set . seed(1109712439) 

R> testind <- sampled :nrow(pimaDiab) , 200) 
R> pimaDiabTrain <- pimaDiab [-testind, ] 

Note that spikeSlabGAMO always expects a dataset without any missing values and responses 
between and 1 for binomial models. 

We increase the length of the burn-in phase for each chain from 100 to 500 iterations and 
run 8 parallel chains for an additive main effects model (if either multicore (Urbanek 2010) 
or snow (Tierney et al. 2010) are installed, the chains will be run in parallel): 

R> mcmc <- list(nChains = 8, chainLength = 1000, burnin = 500, 
+ thin = 5) 

R> mO <- spikeSlabGAM (diabetes ~ pregnant + glucose + pressure + 

+ mass + pedigree + age, family = "binomial" , data = pimaDiabTrain, 

+ mcmc = mcmc) 

We compute the posterior predictive means for the test set, and request a summary of the 
fitted model: 

R> prO <- predict (mO, newdata = pimaDiab [testind, ]) 
R> print ( summary (mO) , printModels = FALSE) 

Spike-and-Slab STAR for Binomial data 
Model : 

diabetes ~ (1 in (pregnant) + sm (pregnant) ) + (lin(glucose) + sm(glucose) ) + 
(lin (pressure) + sm(pressure) ) + (lin(mass) + sm(mass)) + 
(lin (pedigree) + sm(pedigree) ) + (lin(age) + siii(age)) 

524 observations; 58 coefficients in 13 model terms. 

Prior : 

a[tau] b[tau] v[0] a[w] b[w] 
5.0e+00 2.5e+01 2.5e-04 l.Oe+00 l.Oe+00 

MCMC: 

Saved 8000 samples from 8 chain(s) , each ran 5000 iterations after a 

burn-in of 500 ; Thinning: 5 
P-IWLS acceptance rates: 0.92 for alpha; 0.64 for xi. 

Null deviance: 676 
Mean posterior deviance: 474 
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Marginal posterior inclusion probabilities and term importance: 
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spikeSlabGAM selects nonlinear effects for age and mass and a linear trend in glucose. The 
nonlinear effect of pedigree has fairly weak support, mboost: :gamboost ranks the variables 
very similarly, based on the relative selection frequencies of the associated baselearners: 

R> h <- gamboost (as. factor (diabetes) ~ pregnant + glucose + 
+ pressure + mass + pedigree + age, family = Binomial () , 

+ data = pimaDiabTrain) [300] 

R> aic <- AIC(b, method = "classical") 

R> prB <- predict (b[mstop (aic) ] , newdata = pimaDiab[testInd, 
+ ]) 

R> summary (b[mstop (aic) ] )$selprob 

bbs(mass, df = dfbase) bbs (glucose, df = dfbase) 

0.290323 0.266129 

bbs (age, df = dfbase) bbs (pedigree, df = dfbase) 

0.209677 0.120968 

bbs (pregnant , df = dfbase) bbs (pressure, df = dfbase) 

0.072581 0.040323 

Finally, we compare the deviance on the test set for the two fitted models: 
R> dev <- function(y, p) { 

+ -2 * sum(dbinom(x = y, size = 1, prob = p, log = T)) 

+ } 

R> c (spikeSlabGAM = dev(pimaDiab [testind, "diabetes"] , prO) , 

+ gamboost = dev (pimaDiab [testind, "diabetes"] , plogis(prB) ) ) 
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spikeSlabGAM gamboost 
180.51 194.79 



So it seems like spikeSlabGAM's model averaged predictions are a little more accurate than 
the predictions returned by gamboost in this case. 

We can check the sensitivity of the results to the hyperparameters and refit the model with 
a larger vq to see if/how the results change: 

R> hyper 1 <- list (gamma = c(vO = 0.005)) 

R> ml <- spikeSlabGAM (diabetes ~ pregnant + glucose + pressure + 

+ mass + pedigree + age, family = "binomial" , data = pimaDiabTrain, 

+ mcmc = mcmc, hyperparameters = hyper 1) 

R> prl <- predict (ml, newdata = pimaDiabCtestInd, ]) 

R> print (summary (ml) , printModels = FALSE) 

Spike-and-Slab STAR for Binomial data 

Model : 

diabetes ~ (lin (pregnant) + sm (pregnant) ) + (lin(glucose) + sm(glucose)) + 
(lin (pressure) + sm(pressure) ) + (lin (mass) + sm(mass)) + 
(lin(pedigree) + sm (pedigree) ) + (lin(age) + sm(age)) 

524 observations; 58 coefficients in 13 model terms. 

Prior: 

a[tau] b[tau] v[0] a[w] b[w] 
5.000 25.000 0.005 1.000 1.000 

MCMC: 

Saved 8000 samples from 8 chain(s) , each ran 5000 iterations after a 

burn- in of 500 ; Thinning: 5 
P-IWLS acceptance rates: 0.85 for alpha; 0.64 for xi. 

Null deviance: 676 
Mean posterior deviance: 458 

Marginal posterior inclusion probabilities and term importance: 
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R> (dev(pimaDiab[testInd, "diabetes"] , prl)) 
[1] 177.85 

The selected terms are very similar, and the prediction is slightly more accurate (predictive 
deviance for mO was 180.51). Note that, due to the multi-modality of the target posterior, 
stable estimation of precise posterior inclusion and model probabilities requires more parallel 
chains than were used in this example. 

5. Summary 

We have presented a novel approach for Bayesian variable selection, model choice, and regular- 
ized estimation in (geo-)additive mixed models for Gaussian, binomial, and Poisson responses 
implemented in spikeSlabGAM. The package uses the established R formula syntax so that 
complex models can be specified very concisely. It features powerful and user friendly visu- 
alizations of the fitted models. Major features of the software have been demonstrated on an 
example with artificial data with t-distributed errors and on the Pima Indians Diabetes data 
set. In future work, the author plans to add capabilities for "always included" semiparamet- 
ric terms and for sampling the inclusion indicators under hierarchical constraints, i.e., never 
including an interaction if the associated main effects are excluded. 
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